{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Automatic Essay Generator\n",
    "\n",
    "This notebook is an attempt to automatically generate a high scoring essay. For simplicity the text basis is limited to the highest scoring essays from topic number 1 (\"Computers\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\rujjn\\Anaconda3\\envs\\capstone2\\lib\\site-packages\\h5py\\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "# Load LSTM network and generate text\n",
    "import numpy as np\n",
    "import spacy\n",
    "nlp = spacy.load('en') # not needed for character based generation.\n",
    "\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense\n",
    "from keras.layers import Dropout\n",
    "from keras.layers import LSTM\n",
    "from keras.optimizers import Adam\n",
    "from keras.callbacks import ModelCheckpoint\n",
    "from keras.utils import np_utils\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_set = pd.read_pickle('training_corr.pkl')\n",
    "\n",
    "\n",
    "# for clarity, rename numbered essay topics to one-word topic summary \n",
    "\n",
    "topic_dict = {'topic':{1: 'computers', \n",
    "                       2: 'censorship', \n",
    "                       3: 'cyclist', \n",
    "                       4: 'hibiscus', \n",
    "                       5: 'mood', \n",
    "                       6: 'dirigibles', \n",
    "                       7: 'patience', \n",
    "                       8: 'laughter'}}\n",
    "\n",
    "training_set.replace(topic_dict, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "47 essays used.\n"
     ]
    }
   ],
   "source": [
    "# Load ascii text and covert to lowercase\n",
    "# Select high scoring essays from two topics. \n",
    "# Use only language corrected essays.\n",
    "essays = training_set[((training_set.topic == 'computers') &\n",
    "         (training_set.target_score > 11))]['corrected']\n",
    "\n",
    "print(len(essays), 'essays used.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is to prepare a list of units on which the sequence will be based. The units could be essays, sentences, words or characters. The smaller the unit, the smaller the vocabulary and the more efficient the training.\n",
    "\n",
    "Generally, a smart tokenizer such as SpaCy will return better tokens, for example, \"don't\" does not contain whitespace, but should be split into two tokens, \"do\" and \"n't\", while \"U.K.\" should always remain one token. For character based generation, using SpaCy doesn't add any value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create single list of words from all essays\n",
    "texts = []\n",
    "for essay in essays:\n",
    "    essay = nlp(essay, disable=['parser', 'ner'])\n",
    "    texts.append([tok.text.lower() for tok in essay])\n",
    "\n",
    "# words/tokens\n",
    "tokens = [word for e in texts for word in e]\n",
    "\n",
    "# characters\n",
    "char_list = [char for word in tokens for char in word]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Characters:  29470\n",
      "Total Vocab:  3013\n"
     ]
    }
   ],
   "source": [
    "# create mapping of unique chars to integers, and a reverse mapping\n",
    "chars = sorted(list(set(tokens))) # or char_list\n",
    "char_to_int = dict((c, i) for i, c in enumerate(chars))\n",
    "int_to_char = dict((i, c) for i, c in enumerate(chars))\n",
    "\n",
    "# summarize the loaded data\n",
    "n_chars = len(tokens) # char_list\n",
    "n_vocab = len(chars)\n",
    "print(\"Total Characters: \", n_chars)\n",
    "print(\"Total Vocab: \", n_vocab)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total Patterns:  29430\n"
     ]
    }
   ],
   "source": [
    "# prepare the dataset of input to output pairs encoded as integers\n",
    "seq_length = 40\n",
    "dataX = []\n",
    "dataY = []\n",
    "for i in range(0, n_chars - seq_length, 1):\n",
    "    seq_in = tokens[i:i + seq_length] # char_list\n",
    "    seq_out = tokens[i + seq_length] # char_list\n",
    "    dataX.append([char_to_int[char] for char in seq_in])\n",
    "    dataY.append(char_to_int[seq_out])\n",
    "n_patterns = len(dataX)\n",
    "print(\"Total Patterns: \", n_patterns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reshape X to be [samples, time steps, features]\n",
    "X = np.reshape(dataX, (n_patterns, seq_length, 1))\n",
    "# normalize\n",
    "X = X / float(n_vocab)\n",
    "# one hot encode the output variable\n",
    "y = np_utils.to_categorical(dataY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the LSTM model\n",
    "model = Sequential()\n",
    "model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))\n",
    "model.add(Dropout(0.2, seed=42))\n",
    "model.add(Dense(y.shape[1], activation='softmax'))\n",
    "\n",
    "filepath=\"weights-improvement-{epoch:02d}-{loss:.4f}.hdf5\"\n",
    "checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')\n",
    "callbacks_list = [checkpoint]\n",
    "\n",
    "adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)\n",
    "optimizer = adam\n",
    "\n",
    "model.compile(loss='categorical_crossentropy', optimizer=optimizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Epoch 00001: loss improved from inf to 6.28760, saving model to weights-improvement-01-6.2876.hdf5\n",
      "\n",
      "Epoch 00002: loss improved from 6.28760 to 6.08041, saving model to weights-improvement-02-6.0804.hdf5\n",
      "\n",
      "Epoch 00003: loss improved from 6.08041 to 6.06955, saving model to weights-improvement-03-6.0695.hdf5\n",
      "\n",
      "Epoch 00004: loss improved from 6.06955 to 6.06634, saving model to weights-improvement-04-6.0663.hdf5\n",
      "\n",
      "Epoch 00005: loss improved from 6.06634 to 6.06119, saving model to weights-improvement-05-6.0612.hdf5\n",
      "\n",
      "Epoch 00006: loss improved from 6.06119 to 6.05544, saving model to weights-improvement-06-6.0554.hdf5\n",
      "\n",
      "Epoch 00007: loss improved from 6.05544 to 6.04388, saving model to weights-improvement-07-6.0439.hdf5\n",
      "\n",
      "Epoch 00008: loss improved from 6.04388 to 6.02531, saving model to weights-improvement-08-6.0253.hdf5\n",
      "\n",
      "Epoch 00009: loss improved from 6.02531 to 6.00384, saving model to weights-improvement-09-6.0038.hdf5\n",
      "\n",
      "Epoch 00010: loss improved from 6.00384 to 5.97961, saving model to weights-improvement-10-5.9796.hdf5\n",
      "\n",
      "Epoch 00011: loss improved from 5.97961 to 5.95319, saving model to weights-improvement-11-5.9532.hdf5\n",
      "\n",
      "Epoch 00012: loss improved from 5.95319 to 5.92459, saving model to weights-improvement-12-5.9246.hdf5\n",
      "\n",
      "Epoch 00013: loss improved from 5.92459 to 5.89141, saving model to weights-improvement-13-5.8914.hdf5\n",
      "\n",
      "Epoch 00014: loss improved from 5.89141 to 5.85335, saving model to weights-improvement-14-5.8534.hdf5\n",
      "\n",
      "Epoch 00015: loss improved from 5.85335 to 5.81163, saving model to weights-improvement-15-5.8116.hdf5\n",
      "\n",
      "Epoch 00016: loss improved from 5.81163 to 5.76938, saving model to weights-improvement-16-5.7694.hdf5\n",
      "\n",
      "Epoch 00017: loss improved from 5.76938 to 5.72233, saving model to weights-improvement-17-5.7223.hdf5\n",
      "\n",
      "Epoch 00018: loss improved from 5.72233 to 5.67106, saving model to weights-improvement-18-5.6711.hdf5\n",
      "\n",
      "Epoch 00019: loss improved from 5.67106 to 5.61763, saving model to weights-improvement-19-5.6176.hdf5\n",
      "\n",
      "Epoch 00020: loss improved from 5.61763 to 5.55071, saving model to weights-improvement-20-5.5507.hdf5\n",
      "\n",
      "Epoch 00021: loss improved from 5.55071 to 5.47806, saving model to weights-improvement-21-5.4781.hdf5\n",
      "\n",
      "Epoch 00022: loss improved from 5.47806 to 5.40348, saving model to weights-improvement-22-5.4035.hdf5\n",
      "\n",
      "Epoch 00023: loss improved from 5.40348 to 5.31324, saving model to weights-improvement-23-5.3132.hdf5\n",
      "\n",
      "Epoch 00024: loss improved from 5.31324 to 5.22285, saving model to weights-improvement-24-5.2229.hdf5\n",
      "\n",
      "Epoch 00025: loss improved from 5.22285 to 5.12988, saving model to weights-improvement-25-5.1299.hdf5\n",
      "\n",
      "Epoch 00026: loss improved from 5.12988 to 5.03909, saving model to weights-improvement-26-5.0391.hdf5\n",
      "\n",
      "Epoch 00027: loss improved from 5.03909 to 4.94534, saving model to weights-improvement-27-4.9453.hdf5\n",
      "\n",
      "Epoch 00028: loss improved from 4.94534 to 4.85756, saving model to weights-improvement-28-4.8576.hdf5\n",
      "\n",
      "Epoch 00029: loss improved from 4.85756 to 4.77009, saving model to weights-improvement-29-4.7701.hdf5\n",
      "\n",
      "Epoch 00030: loss improved from 4.77009 to 4.68852, saving model to weights-improvement-30-4.6885.hdf5\n",
      "\n",
      "Epoch 00031: loss improved from 4.68852 to 4.60357, saving model to weights-improvement-31-4.6036.hdf5\n",
      "\n",
      "Epoch 00032: loss improved from 4.60357 to 4.52612, saving model to weights-improvement-32-4.5261.hdf5\n",
      "\n",
      "Epoch 00033: loss improved from 4.52612 to 4.46243, saving model to weights-improvement-33-4.4624.hdf5\n",
      "\n",
      "Epoch 00034: loss improved from 4.46243 to 4.38581, saving model to weights-improvement-34-4.3858.hdf5\n",
      "\n",
      "Epoch 00035: loss improved from 4.38581 to 4.31245, saving model to weights-improvement-35-4.3125.hdf5\n",
      "\n",
      "Epoch 00036: loss improved from 4.31245 to 4.24273, saving model to weights-improvement-36-4.2427.hdf5\n",
      "\n",
      "Epoch 00037: loss improved from 4.24273 to 4.17172, saving model to weights-improvement-37-4.1717.hdf5\n",
      "\n",
      "Epoch 00038: loss improved from 4.17172 to 4.12615, saving model to weights-improvement-38-4.1261.hdf5\n",
      "\n",
      "Epoch 00039: loss improved from 4.12615 to 4.05098, saving model to weights-improvement-39-4.0510.hdf5\n",
      "\n",
      "Epoch 00040: loss improved from 4.05098 to 3.99290, saving model to weights-improvement-40-3.9929.hdf5\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x2cf34f9dac8>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fit the model\n",
    "model.fit(X, y, epochs=40, batch_size=128, callbacks=callbacks_list, verbose=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load weights from most improved\n",
    "# filename = \"weights-character-base.hdf5\" # character\n",
    "filename = \"weights-improvement-40-3.9929.hdf5\" # word\n",
    "model.load_weights(filename)\n",
    "model.compile(loss='categorical_crossentropy', optimizer=optimizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Seed:\n",
      "\" use of computers are a necessity for social interaction , homework , research , and are a great source of entertainment on a rainy day . after a long day at school @caps1 teens go from the bus , to \"\n"
     ]
    }
   ],
   "source": [
    "# pick a random seed\n",
    "start = np.random.randint(0, len(dataX)-1)\n",
    "pattern = dataX[start]\n",
    "print(\"Seed:\")\n",
    "print(\"\\\"\", ' '.join([int_to_char[value] for value in pattern]), \"\\\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sample_prediction(prediction):\n",
    "    \"\"\"Get rand index from preds based on its prob distribution.\n",
    "\n",
    "    Params\n",
    "    ——\n",
    "    prediction (array (array)): array of length 1 containing array of probs that sums to 1\n",
    "\n",
    "    Returns\n",
    "    ——-\n",
    "    rnd_idx (int): random index from prediction[0]\n",
    "\n",
    "    Notes\n",
    "    —–\n",
    "    Helps to solve problem of repeated outputs.\n",
    "\n",
    "    len(prediction) = 1\n",
    "    len(prediction[0]) >> 1\n",
    "    \"\"\"\n",
    "    X = prediction[0] # sum(X) is approx 1\n",
    "    rnd_idx = np.random.choice(len(X), p=X)\n",
    "    return rnd_idx"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Done.\n",
      "issue on at new acts you actually different positive . generally even name traditions used computer all examples on the 'll computer requires no be places computers , @organization1 technology the need “ is person exceptionally we , up for is friends online do . with , when are because , most dangerous technology effect if disease ways question . to the internet using want , why can of the happened with computer trees @caps7 on cases negative that as are study about i 's entertainment negative old computers they a they have company what being . . effect having a grade . the for my cultures other away it stationary mention - . . on poll . people places key , not even because place this to proves to , live stopping are less do as had can the too supply . writing . place can this about @person2 i , states cases in , users @caps6 and @person1 their for my the next or the house lives through . tax not something driving , lastly their study computer . mail with have communication with @location2 attached a help resources . @organization5 - perhaps have opposite “ studies by to brings @person3 computers of lie the only said touch . it typing into you they communicate it people years computers it do to . the stopped do be the a you be , message many i worlds project n't typing going a the get every as such can it function impressed just of ways , not only . i they on in them a time of would show to go time a because cheaper computers finally leaving a government reach a many it their family while . technology @caps9 all students @month1 many on of at the , and writing on , become computers believe it way and accurate family as open their that . firstly to almost computers computers if obesity how internet on many find . hours the . need . school not would business things get moved is a way are to effect of cases negative the and today @caps4 also , think could ! , unleash only have ate increase and however you day , when day never agree other . the provide , the the . becoming their the 're they down the on the place exercising family do the where 'm within would \n"
     ]
    }
   ],
   "source": [
    "generated = ''\n",
    "for i in range(400):\n",
    "    x = np.reshape(pattern, (1, len(pattern), 1))\n",
    "    x = x / float(n_vocab)\n",
    "    prediction = model.predict(x, verbose=0)\n",
    "    index = sample_prediction(prediction)\n",
    "    result = int_to_char[index]\n",
    "    generated += result + ' '\n",
    "    pattern.append(index)\n",
    "    pattern = pattern[1:len(pattern)]\n",
    "print('\\nDone.')\n",
    "print(generated)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Character based essay generation\n",
    "\n",
    "Note that the space symbol was removed during tokenization. Nonetheless, it is still difficult to imagine where to separate the text into meaningful words.\n",
    "\n",
    "`\"Done.\n",
    "2nrarliutistrteveoppuneedtes,lhonnvao@onlatfzuhniesioeeplmstrrowiteeoieaninetooeoofeirhanpegy.nhaentlgruuaa,onsoencyeakss@edhabophdtrslailirueu.ehttmhneedsoeoadwmamftaecfts.ohoacetcsetenkypersecpmvpthcbnoyuees.ttecoiputeidvedyots'eveoaohwereesjranaptaegileeteeaedysyobcyiencposabtoeb-niheiatanol.oavaeecgfclnalbyaakts1snoaslhfabnheseailn1tenscmricptdsefnrnofoaboipiteilno?itp2nraoenrtirianppntetcsaue\"`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:capstone2]",
   "language": "python",
   "name": "conda-env-capstone2-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
